Machine Versus Human Clustering of Concepts Across Documents

نویسندگان

  • Christopher S.G. Khoo
  • Shiyan Ou
چکیده

An automated method for clustering terms/concepts from a set of documents on the same topic was developed for the purpose of multidocument summarization. The clustering method makes use of a combination of lexical overlap between multiword terms, syntactic constraints and semantic consideration based on a manually constructed taxonomy to generate hierarchically organized clusters of terms. This study evaluates the machine-generated clusters by calculating the proportion of overlap with two sets of humangenerated clusters for 15 topics. It was found that the overlap between machine-generated clusters and individual human-generated clusters are higher than that between two human-generated clusters. A qualitative analysis of the human clustering found that clusters formed are either semantic-conceptual based or lexical based (similar to machine clustering). The semantic-conceptual based clusters that were formed tended to be different for different human coders. This has raised questions about whether machine-generated clustering can be evaluated by comparing with human clustering. Introduction This paper reports a study of machine and human clustering of concepts across documents. The study is carried out in the context of multi-document summarization research—to develop an automatic method to summarize a set of related documents on a particular topic. Multi-document summarization involves identifying common information found in multiple documents, relations between the pieces of information, as well as unique information found in individual documents. Our approach to multidocument summarization focuses on extracting important terms/concepts in the documents and identifying the relations specified between the concepts. Thus an important step in the summarization method involves clustering similar or related concepts. We adopt a method for clustering concepts using a “global taxonomy” developed manually based on a corpus, in combination with a “local taxonomy” or hierarchical structure of terms constructed from the set of related documents to be summarized. In this paper, we report the results of an evaluation that compares the clusters generated by the automatic method with clusters constructed by human coders. The objectives are to find out: 1. How “good” the machine-generated clusters are compared to human-generated clusters 2. The characteristics of human-generated clusters. Clustering and categorization are a fundamental human behavior, and though there have been many studies of human categorization in the field of cognitive psychology, they have focused on categorization of common objects and concepts. We have not found any work by information science researchers on human clustering of terms and concepts taken from documents. The kind of human clustering research that is closest to ours is the card sorting studies sometimes carried out to develop menu hierarchies and taxonomies for organizing Web sites and information system interfaces. Bar-Ilan and Belous (2007) studied how children organized subject categories taken from Web

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

Scalable Clustering of Documents with Multiple Membership

Document clustering has recently garnered a large amount of attention from the IR, data mining, and machine learning research communities as an effective way of not only organizing textual information, but also for discovering interesting patterns in that information. Most existing methods, however, suffer from two main drawbacks. First, most clustering algorithms are very restrictive, as docum...

متن کامل

Outlier Detection Using Extreme Learning Machines Based on Quantum Fuzzy C-Means

One of the most important concerns of a data miner is always to have accurate and error-free data. Data that does not contain human errors and whose records are full and contain correct data. In this paper, a new learning model based on an extreme learning machine neural network is proposed for outlier detection. The function of neural networks depends on various parameters such as the structur...

متن کامل

Using Machine Learning for Exploratory Data Analysis

This tutorial will introduce attendees to fundamental concepts in the clustering and dimensionality reduction fields of unsupervised machine learning. Attendees will learn about the assumptions algorithms make and how those assumptions can cause the algorithms to be more or less suited to particular datasets. Hands-on interaction with machine learning algorithms on real and synthetic data are a...

متن کامل

جایگاه اخلاق زیستی در قلمرو حقوق بشر بین‌المللی

In the present study, effort has been made to represent a definition of Bioethics and to explain the relationship between Bioethics and Human Rights. Some believe that Bioethics is completely in accordance with Human Rights and there is basically no substantive difference between these concepts. On the other hand, some criticize the Human Rights and they believe that it follows the government p...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008